Social Network Analysis Project

Nareed Hashem - Shiraz Fero

Table Of Content

1. Introduction

1.1. The Data
1.2. Research Question
1.3. Our Approach

2. Building the Network

2.1. Data Cleansing
2.2. Create the Network

3. Distributions

3.1. Degree Distributions
3.2. Betweenness Distributions

4. Our Analysis

4.1. Building the new graph
4.2. Finding most central sub-category

5. Additional Analysis

5.1. Again, central sub-category
5.2. ٍStrongest edge


1. Introduction

1.1. The Data

We chose the Cora Citation Network, it is a directed network where nodes represent scientific papers, an edge between two nodes indicates that the left node cites the right node, the edges are unweighted. In addition the papers are classified to categories and sub-categories.

1.2. Research Question

What is the most cited sub-category that all other categories depend on?

Most central sub-category - one that most of its papers are cited in papers from a different sub-category.

1.3. Approach for our research

To answer the research question we're going to build a new graph where each node is a collection of articles from the same sub-category. This new graph will be directed and weighted where the weight is calculated by: % of papers that have an out degree >= 1 times % of papers that have in degree >=1

Imports

In [127]:
# Import packages
import numpy as np
import os
import pandas as pd
import networkx as nx
import time
from random import sample
import sys
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import plotly
import plotly.graph_objects as go
import plotly.express as px

2. Building the Network

2.1. Data Cleansing

Get data from csv's:

In [3]:
# edges.csv has 91500 rows, nodes.csv has 23166 rows
edges = pd.read_csv('./data/edges.csv', delimiter=' ')
nodes = pd.read_csv('./data/nodes.csv', delimiter=' ')
nodes = nodes[['network_id', 'node_id']]

#get id's, get each node id's category and concat them together in names df
names = pd.read_csv('./data/ent.subelj_cora_cora.id.csv', delimiter=' ')
cat = pd.read_csv('./data/ent.subelj_cora_cora.class.csv', delimiter=' ')
names['category'] = cat[['category']]

result = pd.merge(nodes, names, on='node_id')
result = result.drop(['node_id'], axis=1)

# split category and sub-category
cat_df = result.copy()
cat_df['category'] = cat_df['category'].astype(str)

for index, row in cat_df.iterrows():
    split = cat_df.loc[index]['category'].split('/')
    cat = split[1]
    cat = cat.replace('_', ' ')
    cat_df.loc[index, 'category'] = cat
    subcat = split[2]
    subcat = subcat.replace('_', ' ')
    cat_df.loc[index, 'sub_cat'] = subcat

Final data contains two dataframes one for the edges and the other includes each node and its category and sub-category, an example:

In [130]:
edges.head(2)
Out[130]:
cites cited
0 20128 6078
1 22236 10436
In [5]:
cat_df.head(2)
Out[5]:
network_id category sub_cat
0 1 Databases Performance
1 2 Human Computer Interaction Cooperative

FullLegend.jpeg

FullNetwork.png

2.2. Create the Network

from our data, create a directed graph using networkx library:

In [6]:
edges_tuple = [tuple(x) for x in edges.to_numpy()]
DG = nx.DiGraph()
DG.add_edges_from(edges_tuple)

Some of the network's properties:

In [7]:
print(nx.info(DG))
print("Is the Graph directed? " + str(DG.is_directed()))
print("Graph Density is: " + str(nx.density(DG)))
print("Average Clustering: " + str(nx.average_clustering(DG, nodes=None, weight=None, count_zeros=True)))
Name: 
Type: DiGraph
Number of nodes: 23166
Number of edges: 91500
Average in degree:   3.9498
Average out degree:   3.9498
Is the Graph directed? True
Graph Density is: 0.00017050524281260305
Average Clustering: 0.14601503382808564

3. Distributions

3.1. Degree Distribution

Since our network is directed we'll calculate both in and out degree for each node, we expect to get power-law distribution in both.

In [133]:
'''
Given the "list" of degrees we get from xnetwork this method turns it into a 
df with two columns one for degree and other count
'''
def dictionary_to_df(df, measure):
    all_df = []
    for node,deg in list(df):
        all_df.append(deg)
    unique = list(set(all_df))

    count = []
    for i in unique:
        x = all_df.count(i)
        count.append(x)

    dfout = pd.DataFrame(list(zip(unique, count)), 
                   columns =[measure, 'Freq']) 
    return dfout
In [136]:
import plotly.io as pio
pio.renderers.default = "notebook"
In [137]:
%matplotlib inline

out_deg = DG.out_degree()
in_deg = DG.in_degree()
out_df = dictionary_to_df(out_deg, 'Degree')
in_df = dictionary_to_df(in_deg, 'Degree')

# Plot the in and out degree distribution
fig =  plotly.subplots.make_subplots(rows=1, cols=2, horizontal_spacing=0.1,
                                     subplot_titles=("In-Degree Distribution","Out-Degree Distribution"),
                                                     specs=[[{"type": "xy"},{"type": "xy"}]])
fig.add_trace(
    go.Scatter(x=in_df['Degree'], y=in_df['Freq'],marker_symbol='hexagon2', mode="markers+text", 
               marker=dict(size=12,color='rgba(135, 206, 250, 0.7)', line=dict(width=1, color='DarkSlateGrey'))), row=1, col=1)
fig.add_trace(
go.Scatter(x=out_df['Degree'], y=out_df['Freq'],marker_symbol='hexagon2', mode="markers+text", 
               marker=dict(size=12,color='rgba(135, 206, 250, 0.7)', line=dict(width=1, color='DarkSlateGrey'))), row=1, col=2)
fig.update_xaxes(title_text="Degree")
fig.update_yaxes(title_text="Frequency")
fig.update_layout(height=500, width=1000,showlegend=False)
fig.show()

Conclutions

As expected both the in and out degree distributions are the power-law distribution which means small occurences are extremely common, notice that in the number of in-degree of 0 is much higher than in out-degree which makes sence since most articles cite at least another article but there is a big number of articles that no one cited so far.

3.2. Betweenness Distribution

In [ ]:
bet = nx.edge_betweenness_centrality(DG)
betw_df = dictionary_to_df(bet.items(), 'Betweenness')
betw_df['Freq'] = np.log(betw_df['Freq'])/np.log(10)
In [132]:
fig = px.scatter(betw_df, x="Betweenness", y="Freq")
fig.update_yaxes(title_text="Log Frequency")
fig.update_layout(height=500, width=700,showlegend=False, title='Betweenness Distribution')
fig.show()
In [123]:
#get the nodes which the edge with highest betweenness connects:
all_bet = []
for node,deg in bet.items():
    all_bet.append([node, deg])
bet_df = pd.DataFrame(all_bet, columns=['node_id', 'betweenness'])
bet_df = bet_df.sort_values(by=['betweenness'], ascending=False)
bet_df.head(2)
Out[123]:
node_id betweenness
9212 (9814, 15574) 0.022596
10069 (20584, 15430) 0.013577

Conclutions

The edge with highest betweenness is from SOURCE to TARGET

SOURCE
Sub category: Compression (Encryption and Compression)

TARGET
Sub category: Memory Management (Operating Systems )

An edge with a high edge betweenness centrality score represents a bridge-like connector between two parts of a network, and the removal of which may affect the communication between many pairs of nodes through the shortest paths between them.


4. Our Analysis

We're aiming to find the sub-catergory that is cited the most, as we said before we'll build a new graph where:
Nodes: Sub-categories.
Edges: citations between sub-categories.
Edge Weight: Strength in which one sub-category depends on the other.

4.1. Building the New Graph

Create nodes
In [8]:
# Here we create a df which contains each subcategory , its id, its category and number of articles in it 
unique_subcat = cat_df.sub_cat.unique()
subcat_df = pd.DataFrame(columns=['subcat_id', 'count', 'subcat', 'category'])
i = 0
for subcat in unique_subcat:
    i = i + 1
    rows_list = []
    #get list of all nodes in category
    temp_df = cat_df.loc[cat_df['sub_cat'].isin([subcat])]
    rows_list.append([i, temp_df.shape[0] ,subcat , temp_df.iloc[0]['category']])
    temp_df = pd.DataFrame(rows_list, columns=['subcat_id', 'count', 'subcat', 'category'])
    subcat_df = pd.concat([subcat_df, temp_df])
subcat_df.head(3)
Out[8]:
subcat_id count subcat category
0 1 156 Performance Databases
0 2 91 Cooperative Human Computer Interaction
0 3 246 Java Programming

Here we have all 62 sub-categories, each with an id, its category and count which indicates number of articles in it.

Create edges
In [9]:
# Here we're adding cited and cites sub-category id to the edges df
new_edges = pd.merge(edges, cat_df.drop(columns='category'),left_on='cites', right_on='network_id')
new_edges = pd.merge(new_edges.drop(columns = 'network_id'), subcat_df.drop(columns=['category','count']),left_on='sub_cat', right_on='subcat')
new_edges = new_edges.rename(columns={"subcat_id": "cites_subcat"})

new_edges = pd.merge(new_edges.drop(columns=['sub_cat', 'subcat']), cat_df.drop(columns='category'),left_on='cited', right_on='network_id')
new_edges = pd.merge(new_edges.drop(columns = 'network_id'), subcat_df.drop(columns=['category','count']),left_on='sub_cat', right_on='subcat')
new_edges = new_edges.rename(columns={"subcat_id": "cited_subcat"})
new_edges = new_edges.drop(columns=['sub_cat', 'subcat'])

#Remove rows where cited and cites are same subcategory
for index, row in new_edges.iterrows():
    if row['cites_subcat'] == row['cited_subcat']:
        new_edges = new_edges.drop(index)
        
new_edges.head(3)
Out[9]:
cites cited cites_subcat cited_subcat
4 15915 16609 26 30
5 18504 16609 26 30
38 19022 14594 17 30
Calculate weight on edges

For each edge we'll calculate two percents, given sub-category A citing sub-category B first we'll calculate percent of articles from A cite articles from B, then we'll calculate percent of cited articles in B (by A). the product of these two percents is the edge's weight, it indicates how much A is based on B.

In [19]:
# The first calculation - number of aticles from a sub-category that cite articles from another sub-category.
unique_subcat = new_edges.cites_subcat.unique()
rows_list = []
for source in unique_subcat:
    #get list of all nodes in category
    bycites = new_edges.loc[new_edges['cites_subcat'].isin([source])]
    for target in unique_subcat:
        bycited = bycites.loc[new_edges['cited_subcat'].isin([target])]
        bycited = bycited.drop_duplicates(subset=['cites'])
        rows_list.append([source, target, bycited.shape[0]])
out_df = pd.DataFrame(rows_list, columns=['source' ,'target', 'number'])
out_df.head(3)
Out[19]:
source target number
0 26 26 0
1 26 17 158
2 26 53 5
In [20]:
# The second calculation - number of aticles from a sub-category that are cited by articles from another sub-category.
rows_list = []
for target in unique_subcat:
    #get list of all nodes in category
    bycited = new_edges.loc[new_edges['cited_subcat'].isin([target])]
    for source in unique_subcat:
        bycites = bycited.loc[new_edges['cites_subcat'].isin([source])]
        bycites = bycites.drop_duplicates(subset=['cited'])
        rows_list.append([source, target, bycites.shape[0]])
in_df = pd.DataFrame(rows_list, columns=['source', 'target', 'number'])
in_df.head(3)
Out[20]:
source target number
0 26 26 0
1 17 26 88
2 53 26 0
In [22]:
# turn the numbers into percents, calculate their product and add create the final edges df with weights. 

indf = in_df.copy()
outdf = out_df.copy()

indf = pd.merge(subcat_df, indf ,left_on='subcat_id', right_on='target')
indf['percent_in'] = (indf['number']/indf['count'])
indf = indf.drop(columns=['count', 'subcat_id', 'subcat', 'category', 'number'])
indf = indf.loc[~(indf['percent_in']==0)]

outdf = pd.merge(subcat_df, outdf ,left_on='subcat_id', right_on='source')
outdf['percent_out'] = (outdf['number']/outdf['count'])
outdf = outdf.drop(columns=['count', 'number', 'subcat', 'category', 'subcat_id'])
outdf = outdf.loc[~(outdf['percent_out']==0)]

merged = pd.merge(indf, outdf, left_on=['source','target'], right_on = ['source','target'])
merged['weight'] = merged['percent_in']*merged['percent_out']
merged = merged.drop(columns=['percent_in', 'percent_out'])

merged = merged.sort_values(by=['weight'], ascending=False)

merged.head(3)
Out[22]:
source target weight
1540 40 41 0.107948
240 1 7 0.0698529
239 43 7 0.069088
Create the Network
In [23]:
G = nx.DiGraph()

for index, row in merged.iterrows():
    G.add_edge(row['source'],row['target'],weight=row['weight'])
In [24]:
print("Basic infomation about the new network:")
print(nx.info(G))
print("Is the Graph directed? " + str(G.is_directed()))
print("Graph Density is: " + str(nx.density(G)))
print("Average Clustering: " + str(nx.average_clustering(G, nodes=None, weight=None, count_zeros=True)))
Basic infomation about the new network:
Name: 
Type: DiGraph
Number of nodes: 62
Number of edges: 2140
Average in degree:  34.5161
Average out degree:  34.5161
Is the Graph directed? True
Graph Density is: 0.5658381808566896
Average Clustering: 0.697772949549231

NewGLegend.png

NewG.png

4.2. Finding Most Central Sub-Category

Calculate the in degree for each node (sub-category) which is the sum of all weights on in-edges.

In [25]:
subcat_indegree = pd.DataFrame(G.in_degree(weight='weight'), columns=['subcat_id', 'in_degree'])
subcat_indegree = subcat_indegree.sort_values(by=['in_degree'], ascending=False)
subcat_indegree = pd.merge(subcat_indegree, subcat_df, on='subcat_id')
subcat_indegree.head(2)
Out[25]:
subcat_id in_degree count subcat category
0 12 0.289617 821 Distributed Operating Systems
1 28 0.245358 610 Protocols Networking
Most Central Sub-Category - Distributed from Operating Systems

Since we can, we'll also calculate the sub-category with most out-degree which is the one that relies most on others.

In [26]:
subcat_outdegree = pd.DataFrame(G.out_degree(weight='weight'), columns=['subcat_id', 'out_degree'])
subcat_outdegree = subcat_outdegree.sort_values(by=['out_degree'], ascending=False)
subcat_outdegree = pd.merge(subcat_outdegree, subcat_df, on='subcat_id')
subcat_outdegree.head(2)
Out[26]:
subcat_id out_degree count subcat category
0 4 0.211837 921 Memory Management Operating Systems
1 43 0.204827 149 Query Evaluation Databases

The result is Memory Management also from Operating Systems.


5. Additional Analysis

5.1. Again, Central Sub-Category

We've calculated the central sub-category in the whole world (all 62 sub-categories), now we want to check if that sub-category is also the most central sub-category in it's category. Meaning if we run the same analysis we did before but this time only on the sub-categories from Operating Systems will we get the same answer?

In [44]:
#Get all sub-categories that belong to Operating Systems
os_nodes = subcat_df.loc[subcat_df['category'] == 'Operating Systems']
#Get all edges that are within OS
os_edges = new_edges.loc[new_edges['cites_subcat'].isin(os_nodes['subcat_id']) & new_edges['cited_subcat'].isin(os_nodes['subcat_id'])]
In [53]:
# The first calculation - number of aticles from a sub-category that cite articles from another sub-category.
unique_subcat = os_edges.cites_subcat.unique()
rows_list = []
for source in unique_subcat:
    #get list of all nodes in category
    bycites = os_edges.loc[os_edges['cites_subcat'].isin([source])]
    for target in unique_subcat:
        bycited = bycites.loc[os_edges['cited_subcat'].isin([target])]
        bycited = bycited.drop_duplicates(subset=['cites'])
        rows_list.append([source, target, bycited.shape[0]])
os_out = pd.DataFrame(rows_list, columns=['source' ,'target', 'number'])
os_out = pd.merge(os_nodes, os_out ,left_on='subcat_id', right_on='target')
os_out['percent_out'] = (os_out['number']/os_out['count'])
os_out = os_out.drop(columns=['count', 'subcat_id', 'subcat', 'category', 'number'])
os_out = os_out.loc[~(os_out['percent_out']==0)]

# The second calculation - number of aticles from a sub-category that are cited by articles from another sub-category.
rows_list = []
for target in unique_subcat:
    #get list of all nodes in category
    bycited = os_edges.loc[os_edges['cited_subcat'].isin([target])]
    for source in unique_subcat:
        bycites = bycited.loc[os_edges['cites_subcat'].isin([source])]
        bycites = bycites.drop_duplicates(subset=['cited'])
        rows_list.append([source, target, bycites.shape[0]])
os_in = pd.DataFrame(rows_list, columns=['source', 'target', 'number'])
os_in = pd.merge(os_nodes, os_in ,left_on='subcat_id', right_on='source')
os_in['percent_in'] = (os_in['number']/os_in['count'])
os_in = os_in.drop(columns=['count', 'number', 'subcat', 'category', 'subcat_id'])
os_in = os_in.loc[~(os_in['percent_in']==0)]

merged_os = pd.merge(os_out, os_in, left_on=['source','target'], right_on = ['source','target'])
merged_os['weight'] = merged_os['percent_in']*merged_os['percent_out']
merged_os = merged_os.drop(columns=['percent_in', 'percent_out'])

merged_os = merged_os.sort_values(by=['weight'], ascending=False)

merged_os.head(3)
Out[53]:
source target weight
3 4 12 0.0431533
4 23 12 0.0403763
0 12 4 0.0366598

Build the graph and calculate the in-degree for each node:

In [54]:
OS_G = nx.DiGraph()

for index, row in merged_os.iterrows():
    OS_G.add_edge(row['source'], row['target'], weight=row['weight'])
    
os_indegree = pd.DataFrame(OS_G.in_degree(weight='weight'), columns=['subcat_id', 'in_degree'])
os_indegree = os_indegree.sort_values(by=['in_degree'], ascending=False)
os_indegree = pd.merge(os_indegree, os_nodes, on='subcat_id')
os_indegree
Conclutions

As we expected, the most central sub-category between the Operating Systems sub-categories is the same sub-category that is most central is the whole network which is Distributed

5.2. Strongest Edge

Check which sub-category relies most on another sub-category, meaning it has the highest weight on the edge.

In [124]:
temp = merged.copy()

temp = pd.merge(temp, subcat_df.drop(columns='count'), left_on='source', right_on='subcat_id')
temp = temp.rename(columns={"subcat": "source_name", "category":"source_cat"})
temp = pd.merge(temp.drop(columns='subcat_id'), subcat_df.drop(columns='count'), left_on='target', right_on='subcat_id')
temp = temp.rename(columns={"subcat": "target_name", "category":"target_cat"})
temp = temp.sort_values(by=['weight'], ascending=False)
temp = temp.drop(columns = 'subcat_id')
temp.head(2)
Out[124]:
source target weight source_name source_cat target_name target_cat
0 40 41 0.107948 Filtering Information Retrieval Retrieval Information Retrieval
253 1 7 0.0698529 Performance Databases Relational Databases

The edge with highest weight is Filtering IR to Retrieval IR, which means the first one relies on the second the most related to all other edges.